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Abstract 

Background: Sample size calculation is an important issue in the experimental design of biomedical research. For 
RNA-seq experiments, the sample size calculation method based on the Poisson model has been proposed; however, 
when there are biological replicates, RNA-seq data could exhibit variation significantly greater than the mean (i.e. 
over-dispersion). The Poisson model cannot appropriately model the over-dispersion, and in such cases, the negative 
binomial model has been used as a natural extension of the Poisson model. Because the field currently lacks a sample 
size calculation method based on the negative binomial model for assessing differential expression analysis of 
RNA-seq data, we propose a method to calculate the sample size. 

Results: We propose a sample size calculation method based on the exact test for assessing differential expression 
analysis of RNA-seq data. 

Conclusions: The proposed sample size calculation method is straightforward and not computationally intensive. 
Simulation studies to evaluate the performance of the proposed sample size method are presented; the results 
indicate our method works well, with achievement of desired power. 



Background 

Next generation sequencing (NGS) technology has revo- 
lutionized genetic analysis; RNA-seq is a powerful NGS 
method that enables researchers to discover, profile, and 
quantify RNA transcripts across the entire transcriptome. 
In addition, unlike the microarray chip, which offers only 
quantification of gene expression level, RNA-seq pro- 
vides expression level data as well as differentially spliced 
variants, gene fusion, and mutation profile data. Such 
advantages have gradually elevated RNA-seq as the tech- 
nology of choice among researchers. Nevertheless, the 
advantages of RNA-seq are not without computational 
cost; as compared to microarray analysis, RNA-seq data 
analysis is much more complicated and difficult. In the 
past several years, the published literature has addressed 
the application of RNA-seq to multiple research ques- 
tions, including abundance estimation [1-3], detection of 
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alternative splicing [4-6], detection of novel transcripts 
[6,7], and the biology associated with gene expression pro- 
file differences between samples [8-10]. With this rapid 
growth of RNA-seq applications, discussion of experi- 
mental design issues has lagged behind, though more 
recent literature has begun to address some of the rel- 
evant principles (e.g., randomization, replication, and 
blocking) to guide decisions in the RNA-seq framework 
[11,12]. 

One of the principal questions in designing an RNA- 
seq experiment is: What is the optimal number of bio- 
logical replicates to achieve desired statistical power? 
(Note: In this article, the term "sample size" is used 
to refer to the number of biological replicates or num- 
ber of subjects.) Because RNA-seq data are counts, the 
Poisson distribution has been widely used to model the 
number of reads obtained for each gene to identify dif- 
ferential gene expression [8,13]. Further, [12] used a 
Poisson distribution to model RNA-seq data and derive 
a sample size calculation formula based on the Wald 
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test for single-gene differential expression analysis. It is 
worth noting that a critical assumption of the Poisson 
model is that the mean and variance are equal This 
assumption may not hold, however, as read counts could 
exhibit variation significantly greater than the mean 
[14]. That is, the data are over-dispersed relative to the 
Poisson model In such cases, one natural alternative to 
Poisson is the negative binomial model Based on the 
negative binomial model, [14,15] proposed a quantile- 
adjusted conditional maximum likelihood procedure to 
create a pseudocount which lead to the development 
of an exact test for assessing the differential expres- 
sion analysis of RNA-seq data. Furthermore, [16] pro- 
vided a Bioconductor package, edgeR, based on the exact 
test. 

Sample size determination based on the exact test has 
not yet been studied, however. Therefore, the first goal of 
this paper is to propose a sample size calculation method 
based on the exact test. 

In reality, thousands of genes are examined in an 
RNA-seq experiment; differential expression among those 
genes is tested simultaneously, requiring the correc- 
tion of error rates for multiple comparisons. For the 
high-dimensional multiple testing problem, several such 
corrected measures have been proposed, such as family- 
wise error rate (FWER) and false discovery rate (FDR). 
In high-dimensional multiple testing circumstances, con- 
trolling FDR is preferable [17] because the Bonferroni 
correction for FWER is often too conservative [18]. 
Many methods have been proposed to control FDR in 
the analysis of high-dimensional data [17,19,20]. Those 
concepts have been extended to calculate sample size 
for microarray studies [21-25]. To our knowledge, how- 
ever, the literature does not address determination of 
sample size while controlling FDR in RNA-seq data. 
Therefore, the second purpose of this paper is to pro- 
pose a procedure to calculate sample size while control- 
ling FDR for differential expression analysis of RNA-seq 
data. 

In sum, in this article, we address the following two 
questions: (i) For a single-gene comparison, what is 
the minimum number of biological replicates needed 
to achieve a specified power for identifying differential 
gene expression between two groups? (ii) For multi- 
ple gene comparisons, what is the suitable sample size 
while controlling FDR? The article is organized as fol- 
lows. In the Method section, a sample size calculation 
method is proposed for a single-gene comparison. We 
then extend the method to address the multiple compar- 
ison test issue. Performance comparisons via numerical 
studies are described in the Results section. Two real 
RNA-seq data sets are used to illustrate sample size cal- 
culation. Finally, discussion follows in the Conclusions 
section. 



Method 

Exact test 

In an RNA-seq experiment, the total number of reads, also 
referred to as library size, mapped to the genome are dif- 
ferent among the samples. In such cases, the counts in 
each group are not identically distributed, and it is diffi- 
cult to develop an exact test for assessing the differential 
expression analysis of RNA-seq data. To handle this issue, 
[14,15] proposed a quantile- adjusted conditional maxi- 
mum likelihood procedure to create pseudocounts which 
are approximately identically distributed and which lead 
to the development of an exact test. In the following, 
the proposed sample size calculation method is based the 
exact test for a single-gene comparison. Let Yij be the ran- 
dom variable corresponding to the pseudocount, with yij 
being the observed value of Yjy, of the ;th (j = 1, 2, ... , n\) 
sample of the /th (i = 0, 1) group where no and n\ are 
the numbers of samples from the control and treatment 
group, respectively. Assume pseudocount Yjy can be mod- 
eled as a negative binomial (NB) distribution, NB^y/j, 0). 
Here, y; represents the normalized gene expression level 
of group /, dij represents a normalization factor for the 
total number of reads mapped in the ;th sample of the ith. 
group, and <p is the dispersion. We use the NB parame- 
terization where the mean is /x,y = dyyi and variance is 
/x,y(l + Because the question of interest is to iden- 

tify the differential gene expression between two groups, 
the corresponding testing hypothesis is 

H 0 :yi = yo vs. Hi : yi ^ Yo- U) 

Because the pseudocounts in each group have an 
approximately identical negative binomial distribution 
[14,15], the sum of pseudocounts of each group, 
Yi = YljLi Yij, has a negative binomial distribution 
NB(riidf yi, (f) / ni) where d* is the geometric mean of nor- 
malization factors in group /. Under the null hypothesis 
(1), the sum of the total pseudocount, Y\ + Yo, follows 
a negative binomial distribution. In analogy with Fishers 
exact test, [14,15] proposed an exact test for replac- 
ing the hypergeometric probabilities with negative bino- 
mial probabilities. Because [16] developed a Bioconductor 
software package edgeR which is an implementation of 
methodology developed by [14,15], the p- value can be 
easily calculated for conducting the exact test. 

In the following simulation and application sections, we 
used edgeR version 3.0.6 for estimating model parameters 
and performing the exact test. 

Sample size calculation for controlling type I error rate 

In this section, we focus on sample size calculation 
based on the exact test for a single-gene comparison as 
described in the test statistics section. For simplicity, we 
assume the RNA-seq experiment uses a balanced design 
(i.e., no = n\ = n), which is a special but common case. 
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The following method could be easily extended to the 
unbalanced case (i.e. let no = n and n\ = kn where k 
is a predetermined ratio of the sample size of the control 
group to the treatment group). In order to perform sample 
size calculations, it is necessary to construct a power func- 
tion for the testing described above. The power of a test 
is the probability that the null hypothesis is rejected when 
the alternative hypothesis is true. Since the distribution of 
the exact test statistic under the alternative hypothesis is 
unknown, however, it is difficult to derive a closed-form 
expression of the power function. Instead of deriving the 
distribution of test statistic under the alternative hypothe- 
sis, [26] proposed a method to calculate the power for the 
exact test based on a given j?-value. Here, we borrow their 
concept to calculate power. For a given ^-value, p(yi,yo) 
where yo and y\ are the observed pseudo-sums, described 
in the previous section, the power can be expressed as 

i=(n f p f fio f <p f w f <x) = 

oo oo 



EE/ 

„ — n „„ — n V 



yo—Oyi=0 



nwpno,- jf Infio,- )l(p(yi,yo) < a), 



where w = d\ld^ is the ratio of the geometric means of 
normalization factors between two groups, p = yi/yo is 
the fold change, /xn = oLq yo is the average number of reads 
in the control group, /(/x, 0) is the probability mass func- 
tion of the negative binomial distribution with mean fi 
as well as dispersion 0, a is the the level of significance, 
and /(.) denotes the indicator function. For a given desired 
power 1 — /3, the power of the test can be represented as 
the function of sample size in the form 



1 - P = %(n,p,fi 0 ,(t>,w,a). 



(2) 



Thus, the required sample size n to attain the given 
power 1 — at level of significance a can then be calcu- 
lated by solving (2) through a numerical approach, such as 
a gradient-search or bisection procedure. 

Sample size calculation for controlling false discovery rate 

In reality, thousands of genes are examined in an RNA- 
seq experiment, and those genes are tested simultaneously 
for significance of differential expression. In such cases, 
the sample size calculation for a single-gene comparison 
discussed above cannot be applied directly. Jung, 2005 
[23] incorporated FDR controlling based on a two-sample 
t-test under the Gaussian distribution assumption. In this 
section, we borrowed their concept to incorporate FDR 
controlling based on the test statistics described in the test 
statistics section. 

For the multiple testing problem, [19] suggested the 
use of false discovery rate (FDR) which is defined as the 
expected proportion of false discoveries among rejected 



null hypotheses. Storey, 2002 [17] further proposed an 
improvement to FDR to achieve higher power, in the form 



FDR: 



(flH< 



where Ro is the number of false discoveries and R is the 
number of results declared significant (i.e., rejections of 
the null hypothesis). 

To calculate the sample size for microarray data analysis, 
[23] proposed an FDR-controlled method which is based 
on the expression of FDR under independence (or weak 
dependence) among test statistics, as 

YHoOL 

FDR = , 

moot +E(Ri) 

[17,27], where mo is the number of true null hypotheses 
and E(R\) is the expected number of true rejections. By 
borrowing their concepts, the expected number of true 
rejections for RNA-seq data can be calculated as 

E(Rx) = ^ %(n,p g ,^og,(pg,w,oi), 
geMi 

where p g is the fold change, (p g is the dispersion, and flog 
is the average read count in the control group for gene 
g e Mi (the set of prognostic genes), respectively. Thus, 
to guarantee an expected number of true rejections, say 
ri, and control FDR at a specified level f, we have 
moot 



f = 



m 0 a + n 



(3) 



and 



ri = ^2 %(n,p g ,iiQ g ,(l) g ,w,a). (4) 
geMi 

By solving equation (3) with respect to a, we have 

a* = n ^ 

moil-fV 

where a* is the marginal type I error level for the expected 
number of true rejections r\ at a given FDR / Replacing a 
with a* in (4), we have the function with respect to n as 

g\(n) = ^2 %(n>Pg>V<Qg><l>g>w><x*) -n- 
geMi 

Then, by solving g\ (ri) = 0 via a numerical approach, 
the required sample size for controlling FDR at level/ can 
be obtained. 

To calculate the sample size, we have to estimate all 
of the fold changes p g , dispersions (j) g , and average read 
counts /jio g of gene g for the set of prognostic genes g e Mi 
prior to the RNA-seq experiment. However, we may not 
have enough information to estimate all of those param- 
eters in practice. To address this issue, we propose the 
following method to obtain a conservative estimate of 
the required sample size. Because the power increases as 
| log 2 (p^)| or jjio g increases and <p g decreases, we suggest 
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using a common p* = arg min^^ { I Ic^O^)!) minimum 
fold change, /Xq = mm ge M l {/^og} minimum average read 
count, and 0* = max^ G Mi{^} maximum dispersion to 
estimate each p g , /xq^, and 0^, respectively. In such cases, it 
gives a more conservative estimate of the required sample 
size. 

When we use p*, /Xq, and 0* to estimate each p g , /xq^, 
and <p g ,g £ Mi, in the multiple testing context, a* and /3* 
can be calculated as rj/ /(mo (1 — /)) and 1 — ri/rai, respec- 
tively, where mi is the number of prognostic genes. In 
other words, the power function (2) can be applied in the 
case of multiple gene comparison, with the replacement of 
a and ft with a* and ft*. 

The procedures for sample size calculation detailed in 
this section can be summarized as follows: 

1. Specify the following parameters: 

m : total number genes for testing; 
mi : number of prognostic genes; 
r\ : number of true rejections; 
f : FDR level; 

w : ratio of normalization factors between two 
groups; 

{f^o g ,g £ M\} : average read counts for prognostic 
gene g in control group; 

{p g ,g £ M\] : fold changes for prognostic genes g in 
control group; 

{(j) g ,g £ M{} : dispersion for prognostic genes g in 
control group; 

2. Calculate sample size: 

(a) If all the parameters fio g , p g , and (p g for each 
prognostic gene g are known, use a 
numerical approach to solve the equation 
below with respect to n. 



n = %(n,p g ,no g ,(j) g ,w,a*), 
geM x 



wherea* = ri//(mo(l — /)) and 
mo = m — mi; 
(b) Otherwise, 

(I) specify a desired minimum fold change 
p*, a minimum average read count /Xq, 
and a maximum dispersion 0*; 

(II) replace p = p*, /xo = /Xq, 0 = 0*, 
a = r]//(m 0 (l -/)), and 
P = 1 — ril mi in equation (2) and 
solve it with respect to n. 



Results 

Numerical studies 

In this section, we conducted simulation studies to eval- 
uate the accuracy of the proposed sample size formula. 
The parameter settings in simulation studies are based on 
empirical data sets. 

We set the total number of genes for testing to be m = 
10000 and the number of statistically significant prognos- 
tic genes mi = 100. We wanted to detect the expected 
number of true rejections r\ — 80, which corresponds to 
a power of 80% (i.e. ^* = 0.2). All parameters fio g , p g , 
and (j) g (g = 1, . . . , 10000) were assumed to be unknown. 
Thus, we used a minimum fold change p* and a mini- 
mum average read count /Xq and a maximum dispersion 
0* to estimate each p g) /xq^, and <p g , g = 1 . . . , 10000. 
We varied /Xq = 1 or 5; log2-fold changes log 2 (p*) = 
0.5, 1.0, 1.5, 2.0 or 2.5; and 0* = 0.1, or 0.5. With these set- 
tings, a* = 8.162 x 10~ 5 , 4.253 x 10~ 4 , and 8.979 x 10~ 4 , 
which correspond to controlling FDR at level 1%, 5%, and 
10%, respectively. 

Then, we substituted a* and ^* into the formulas (2) 
and calculated sample size by solving this equation. In 
addition, for each design setting, we generated 5000 sam- 
ples from independent negative binomial distributions 
based on the calculated sample size n; for the control 
group, the count of each gene is generated by R program 
from a negative binomial distribution with mean /Xq and 
dispersion 0*; for the treatment group, the count of each 
gene is generated from a negative binomial distribution 
with mean p*/xj and dispersion 0*. Then, edgeR is used 
to estimate model parameters and perform the exact test. 
The number of true rejections was counted using the q- 
value procedure proposed by [20]. The expected number 
of true rejections was estimated as the sample mean of the 
number of rejections of the 5000 simulation samples (ri). 

In Table 1, we showed the calculated sample size with 
corresponding ri in parentheses under the case w = 1. 
For a fixed log2-fold change, dispersion, and FDR, sam- 
ple size increases when /xo decreases. This result is as 
expected; a small average read count provides less infor- 
mation, such that a larger sample size is required to detect 
the difference. For a fixed /Xq, 0*, and FDR, sample size 
increases when log 2 (p*) decreases (i.e. the smaller log2- 
fold changes requires greater sample sizes with all else 
being equal). This result is as expected; a larger sample 
size is required for detecting a smaller difference. For a 
fixed /Xq, log 2 (p*), and FDR, sample size increases when 
0* increases. This result, also, is as expected; the variation 
increases when dispersion increases, such that a larger 
sample size is required to detect the difference. Note that 
all ?i in Table 1 are close to the pre-specified number of 
true rejections (ri = 80); thus, the proposed method esti- 
mated a sample size that achieves correct power at the 
specified FDR level. 
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Table 1 Sample size calculation for simulation study 
(and h ) with ri = 80 at FDR = 1 %, 5% and 1 0% when 
w= 1, m = 10000, rc?! = 100 









Hj = 1 






^ = 5 










FDR 






FDR 




log 2 (/>*) 


0* 


1% 


5% 


10% 


1% 


5% 


10% 


0.5 


0.1 


365 (81) 


305 (84) 


278 (88) 


104 (81) 


87 (84) 


79 (88) 




0.5 


518(81) 


433 (84) 


394 (88) 


257 (81) 


215 (84) 


196 (89) 


1.0 


0.1 


79 (81) 


67 (84) 


61 (87) 


24 (82) 


20 (84) 


19(91) 




0.5 


119(81) 


99 (83) 


91 (88) 


63 (82) 


53 (85) 


48 (89) 


1.5 


0.1 


31 (82) 


26 (83) 


24 (86) 


10(83) 


9(90) 


8(91) 




0.5 


49 (81) 


41 (83) 


38 (88) 


28 (83) 


23 (84) 


21 (86) 


2.0 


0.1 


16(85) 


1 3 (84) 


1 2 (86) 


6(90) 


5(92) 


4(86) 




0.5 


26 (82) 


22 (84) 


20 (86) 


1 6 (84) 


1 3 (85) 


1 2 (89) 


2.5 


0.1 


8(85) 


7(89) 


6(87) 


3(78) 


3(81) 


3(98) 




0.5 


14(83) 


1 2 (87) 


1 1 (84) 


1 0 (82) 


9(90) 


8(91) 



Applications 

Liver and kidney RNA-seq data set 

To identify differentially expressed genes between human 
liver and kidney RNA samples, [8] explored an RNA-seq 
data set containing 5 human kidney samples and 5 human 
liver samples. In the following, we used this data set as 
pilot data for designing a new study with the same study 
objective. For the purpose of demonstration, we assumed 
that the human kidney is the control group. After filtering 
genes with no more than 5 total reads in liver samples or 
kidney samples, there were 17306 genes left. We assumed 
that the top 175 ( ~ 1% of 17306) genes are prognostic. 
From the pilot data, the minimum average read counts 
among the prognostic genes in the control group were 
estimated as /Xq = 5, the maximum dispersion was esti- 
mated as 0* = 0.0029, and the ratio of the geometric 
mean of normalization factors between the two groups 
was estimated as w = 0.9 using edgeR. Suppose we want 
to identify 80% of the prognostic genes (i.e. r\ = 0.8 x 
175 = 140), while controlling FDR at 1% (i.e./ = 0.01). 
Based on the pilot data, we set m = 17306, m\ = 175, 
m 0 = 17131, ri = 140, and/ = 0.01. In this case, 
we have 



m 0 (l -/) 
= 8.2549 x 10" 5 

and 



= 0.2. 

After substituting those parameters into equation (2) 
and solving it with respect to n, the required sample 
size can be obtained. In the second column from the left 



of Table 2, we report the sample size while controlling 
FDR at 1% under various desired minimum fold changes 
p* = 0.10, 0.25, 0.50, 0.75, 1.25, 1.50, 2.00, 2.50, and 3.0. 
From Table 2, we found that the original RNA-seq experi- 
ment described in [8] with sample size 5 in each group can 
identify 80% of the prognostic genes at FDR= 1% if the 
desired minimum fold change p* is 3.0. 

Li, 2013 [28] proposed several sample size calculation 
methods for RNA-seq data under the Poisson model. To 
compare the difference in sample size calculation between 
the negative binomial method and Poisson method, in the 
last six right columns of Table 2 we report the sample 
size calculation based on Poisson model (i.e. the sam- 
ple size based on the Wald test n w , score test n s , log 
transformation of Wald statistic ni w , log transformation 
of score statistic n\ Si transformation of Poisson n tp , and 
likelihood ratio test n/ r ) with the same settings as the neg- 
ative binomial method. As we can see, the sample size 
calculation based on the negative binomial and Poisson 
methods are similar. This result is as expected since the 
data set explored by [8] has technical and not biological 
replicates (i.e. the maximum dispersion estimated from 
the liver and kidney RNA-seq data set is close to zero). 
Thus, it is not surprising that the results of the negative 
binomial and Poisson methods are similar when the dis- 
persion parameter is close to zero. Moreover, in Table 2, 
the estimated sample size is about the same size for very 
small fold changes (p* = 0.10) and very large fold changes 
(p* = 3.0). This result is expected since it tends to the same 
conclusion no matter what statistical model is used when 
the treatment effect is very large (i.e. the fold change is 
very large or small). 

Transcript regulation data set 

Blekhman, 2010 [29] used RNA-seq to study transcript 
regulation in humans, chimpanzees, and rhesus macaques 

Table 2 Sample size calculation for liver and kidney 
RNA-seq data set under various desired minimum fold 
changes (p*) for n= 140 at FDR= 1%whenm= 17360 
and m<\ = 175 



NB Poisson 



p* 


n 


n w 


n s 


ni w 


nis 




n !r 


0.10 


7 


7 


7 


11 


5 


5 


7 


0.25 


11 


11 


11 


13 


9 


9 


10 


0.50 


30 


29 


30 


31 


28 


27 


29 


0.75 


139 


134 


136 


137 


133 


132 


135 


1.25 


178 


175 


173 


174 


174 


177 


181 


1.50 


50 


49 


48 


49 


48 


50 


50 


2.00 


15 


15 


15 


15 


14 


16 


15 


2.50 


8 


8 


8 


8 


7 


8 


8 


3.00 


5 


5 


5 


6 


5 


6 


5 
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using liver RNA samples from three males and three 
females from each species. For the purpose of demon- 
stration, we assumed that the goal of the study is 
to identify differential gene expression between male 
and female in humans and that the female is consid- 
ered the control group. There were 13267 genes in 
the data set after performing quality control analyses. 
Suppose that the top 133 ( ^ 1% of 13267) genes are prog- 
nostic. After filtering genes with no more than 5 total 
reads in male samples or female samples, there were 7658 
genes left. Those genes are considered pilot data, and 
we assessed the differential expression by using edgeR. 
From the pilot data, the minimum average read counts 
among the prognostic genes in the control group were 
estimated as /Xq = 1.67, 0* = 0.6513, and the ratio 
of the geometric mean of normalization factors between 
the two groups was estimated as w = 1.08. Suppose we 
want to identify 80% of the prognostic genes (i.e. r\ = 
0.8 x 133 = 107), while controlling the FDR at 10%. 
Based on the pilot data, we set m = 13267, m\ = 133, 
mo = 13134, r\ = 107 and / = 0.1. In this case, we 
have a* = 9.0512 x 10~ 4 and = 0.2. In the second 
column from the left of Table 3, we report the required 
sample sizes under various desired minimum fold changes 
while controlling the FDR at 10% under the negative bino- 
mial distribution. We also report the required sample 
size based on the Poisson model proposed by [28] under 
the same settings in the last six columns on the right of 
Table 3. As we can see, the required sample size based on 
the negative binomial method is greater than the Poisson 
method. In the transcript regulation data set, the max- 
imum dispersion was estimated as 0* = 0.6513 > 0. 
This indicates that the read counts in this data set exhibit 
over-dispersion. In such a situation, it is inappropriate 
to model this data set based on the Poisson, and the 
sample size calculation based on the Poisson model will 

Table 3 Sample size calculation for transcript regulation 
data set under various desired minimum fold changes (/>*) 
forn = 107 at FDR= 10% when m = 13267 and m 1=133 



NB Poisson 



p* 


n 


n w 


n s 


"lw 


n/s 


ntp 


n, r 


0.10 


19 


15 


14 


21 


10 


10 


14 


0.25 


35 


23 


23 


26 


19 


19 


21 


0.50 


109 


62 


60 


62 


58 


56 


59 


0.75 


558 


284 


281 


282 


280 


273 


281 


1.25 


821 


316 


363 


366 


360 


371 


381 


1.50 


240 


100 


102 


103 


99 


105 


105 


2.00 


79 


30 


31 


32 


29 


32 


32 


2.50 


44 


16 


16 


18 


15 


17 


16 


3.00 


30 


10 


11 


12 


9 


11 


10 



be underestimated due to underestimation of variance 
(i.e. the study based on the corresponding sample size will 
be underpowered). 

Discussion 

In this research, we assume independent gene expression 
levels; however, this assumption may not hold in reality. 
For correlated RNA-seq gene expression data, evaluation 
of the accuracy of our method is an important future 
research question; however, generating a negative bino- 
mial distribution for correlated high-dimensional data will 
be a challenge. Moreover, most of the major R pack- 
ages dedicated to RNA-seq differential analyses (edgeR, 
DESeq, etc.) are now starting to enable multi-group com- 
parisons. However, the proposed method is developed for 
comparing two-group means. Thus, the sample size cal- 
culation for multi-group comparisons would be an inter- 
esting research topic for us in the future. In addition, 
it has already been noted that typical RNA-seq differ- 
ential analyses have very low power; see for example 
the simulation studies in [30], where power for edgeR 
was always less than 60%, or [31], where power ranged 
from about 45% to 55% (both with 10 samples per con- 
dition). In our simulation and application sections, the 
minimum sample sizes required to achieve 80% power 
would be prohibitively large for RNA-seq experiments 
in practice, given their current cost. In such situations, 
the findings in [30,31] can provide useful information 
for specifying achievable power. It is well known that 
low study power will decrease the reproducibility of 
scientific research. We hope that this paper can benefit 
researchers by allowing them to understand their study 
power. 

Conclusions 

In recent years, RNA-seq technology has emerged as 
an attractive alternative to microarray studies, due to 
its ability to produce digital signals (counts) rather than 
analog signals (intensities), and to produce more highly 
reproducible results with relatively little technical vari- 
ation [32,33]. With a large sample size, RNA-seq can 
become costly; on the other hand, insufficient sam- 
ple size may lead to unreliable answers to the research 
question of interest. To manage the trade-off between 
cost and accuracy, sample size determination is a crit- 
ical issue for RNA-seq experimental design. For com- 
paring the differential expression of a single gene, we 
have proposed a sample size calculation method based 
on an exact test proposed by [14,15]. To address mul- 
tiple testing (i.e., multiple genes), we further extended 
our proposed method to incorporate FDR control Our 
methods are not computationally intensive for pilot data 
or other relevant data with a specified desired minimum 
fold change, minimum average read count, and maximum 
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dispersion. To facilitate implementation of the sample size 
calculation, R code is available from the corresponding 
author. 
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